Presentation

Group 4

Introduction

  • CRC has the 3rd highest mortality rate

  • More effective methods to detect CRC are needed

  • Correlation between exosomes and tumorigenesis

  • miRNA and mRNA can serve as biomarkers - these we want to find!

Data retrieval - GSE

  • Using the library GEOquery, the data was loaded -> no need to download any files

  • Both primary data and meta data was loaded

  • Data was already standardized

Data description - GSE

Hi

Data analysis - GSE

  • Log2 Transformation:
    • Data points below 0 were removed and converted to NaN.
  • Design Matrix for Tumor vs Normal Comparison:
  • Limit Expressed Genes Based on Median Expression:
    • Only genes with above-median expression in 1/3 of the samples were retained.
  • Linear Model Fitting and Contrasts:
    • A linear model was fitted.
    • Contrasts were created between the two groups (Tumor and Normal).
    • Empirical Bayes’ step was performed to obtain statistics and p-values.
  • Array Weights for Model Fitting:
    • An array of weights was created to fit the data into the model.
  • Empirical Bayes’ Step (Again):
    • The empirical Bayes’ step was applied again with the array weights.
  • Tidying data

Data retrieval - TCGA (Mathilde)

  • Fetch analyte.tsv & clinical.tsv from raw_/

    • Obtain IDs of the patients for whom the RNA expression was registered
  • Library TCGABiolinks is used to retrieve data from the GDC data portal

    • Function: retrieve_and_prepare()

      • GDCquery: Query to specify the data to get

      • GDCdownload: Downloading the samples from the query

  • Example:

miRNA_data_cancer <- retrieve_and_prepare_data(
  project = "TCGA-COAD",
  data_category = "Transcriptome Profiling",
  data_type = "miRNA Expression Quantification",
  workflow_type = "BCGSC miRNA Profiling",
  id_cancer_patients = id_cancer_patients_cancer,
  directory_prefix = "samples_miRNA"
)

  • miRNA data - 2 separate dataframes

  • mRNA data - Large SummarizedExperiment

Data description - TCGA (Mathilde)

  • Data tidying -
    • Like families, tidy datasets are all alike but every messy dataset is messy in its own way
  • Data visualizations:
    • Number of cancer vs. normal samples
    • Gender distribution
    • Pathological cancer stages

Data preprocess - TCGA (Ksenia)

  • Creating metadata for patients ID (TCGA # - Tissue Status) for mRNAs and miRNAs.
  • Log2 transformation of two datasets and adjustment of the organizations. For them to have the same style of a tidy data: rows = gene names, columns = patients ID, value = expression data.

Data augmentation - TCGA, edgeR

  • Calculation of the normalization factors for the data (_log) with calcNormFactors and imputation of NAs using means.

  • Running a “universal” edgeR differential analysis function with a quasi likelihood model.

Statistics table:

# A tibble: 1,881 × 6
   miRNA_ID      logFC logCPM     F   PValue      FDR
   <chr>         <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 hsa-mir-135b   4.83   11.5  245. 4.11e-19 7.73e-16
 2 hsa-mir-19b-2  2.99   11.6  235. 2.41e-15 1.81e-12
 3 hsa-mir-590    4.43   11.2  154. 2.88e-15 1.81e-12
 4 hsa-mir-374a   2.14   12.0  279. 6.03e-15 2.84e-12
 5 hsa-mir-450b   5.22   11.0  191. 4.85e-14 1.57e-11
 6 hsa-mir-19a    5.75   11.4  159. 5.02e-14 1.57e-11
 7 hsa-mir-889    3.95   11.2  200. 1.41e-13 3.78e-11
 8 hsa-mir-19b-1  2.37   11.7  308. 1.22e-12 2.86e-10
 9 hsa-mir-708    2.61   11.5  184. 2.13e-12 4.45e-10
10 hsa-mir-96     3.24   11.2  113. 3.38e-12 6.36e-10
# ℹ 1,871 more rows

Final augmented dataset:

# A tibble: 75,240 × 3
   miRNA_ID     TCGA_ID          log_reads
   <chr>        <chr>                <dbl>
 1 hsa-let-7a-1 TCGA-F4-6854-01A      14.5
 2 hsa-let-7a-1 TCGA-AA-A00O-01A      13.0
 3 hsa-let-7a-1 TCGA-DM-A28F-01A      15.4
 4 hsa-let-7a-1 TCGA-NH-A6GC-01A      15.2
 5 hsa-let-7a-1 TCGA-AA-A010-01A      13.8
 6 hsa-let-7a-1 TCGA-AA-A00D-01A      12.4
 7 hsa-let-7a-1 TCGA-AA-A00U-01A      11.3
 8 hsa-let-7a-1 TCGA-D5-6922-01A      15.7
 9 hsa-let-7a-1 TCGA-G4-6323-01A      15.2
10 hsa-let-7a-1 TCGA-F4-6703-01A      14.5
# ℹ 75,230 more rows

Results (All)

  • Volcano Plot & Heatmap:
    • ‘diffexpressed’ with values: NO, UP, and DOWN.
    • ‘label’ with GENE_ID’s of overexpressed genes.’

TCGA mRNA TCGA miRNA GSE miRNA

Conclusions

  • The TCGA and GSE datasets have different stages, and the data we used has a different sample size in each stage.

  • Although we were able to follow the article’s instructions, there are significant differences in our results. It might be brought on by some extra measures taken during data preprocessing, or by the authors’ sparse information. It would be wise to get in touch with the authors to inquire further about preprocessing and data retrieval. Overall, our analysis was carried out accurately, and the results did not indicate any grave errors. In addition data we used has diffrent amount of sample in each stages, and stages differ between TCGA and GSE datasets